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Abstract 

The application of cognitive mechanisms to support knowledge acquisition is, from 
our point of view, crucial for making the resulting models coherent, efficient, credible, 
easy to use and understandable. In particular, there are two characteristic features 
of intelligence that are essential for knowledge development: forgetting and consolida¬ 
tion. Both plays an important role in knowledge bases and learning systems to avoid 
possible information overflow and redundancy, and in order to preserve and strengthen 
important or frequently used rules and remove (or forget) useless ones. We present an 
incremental, long-life view of knowledge acquisition which tries to improve task after 
task by determining what to keep, what to consolidate and what to forget, overcoming 
The Stability-Plasticity dilemma [lj. In order to do that, we rate rules by introduc¬ 
ing several metrics through the first adaptation, to our knowledge, of the Minimum 
Message Length (MML) principle |2j to a coverage graph , a hierarchical assessment 
structure which treats evidence and rules in a unified way. The metrics are not only 
used to forget some of the worst rules, but also to set a consolidation process to pro¬ 
mote those selected rules to the knowledge base, which is also mirrored by a demotion 
system. We evaluate the framework with a series of tasks in a chess rule learning 
domain. 

Keywords: Cognitive abilities, forgetting, consolidation, lifelong machine learning, 
knowledge acquisition, declarative learning, MML. 


1 Introduction 

Machine learning and other data analysis techniques are becoming crucial for many ap¬ 
plications where we want to turn (big) data into knowledge. However, any conception of 
knowledge discovery that aims at generating more insightful results must overhaul the whole 
process with an incremental, developmental perspective. The view cannot longer be a trans¬ 
formation from data to knowledge, but a transformation of knowledge (plus data) into new 
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knowledge. As a result, properly representing, revising, evaluating, organising and retriev¬ 
ing previous knowledge is crucial in this quest for more complex, insightful, powerful and 
ultimately cognitive approaches to make knowledge discovery an incremental process. 

Knowledge acquisitiorQ understood as an automated process of abstracting knowledge 
from facts and other knowledge, cannot be understood as a naive accumulation of what is 
being learned. It should be checked whether new learned knowledge can be redundant, ir¬ 
relevant or inconsistent with old one, and whether it may be built upon previously acquired 
knowledge. From our point of view, knowledge acquisition systems should be developed for 
this purpose. This lead us to one of the well-known constraints for AI systems: The Stability- 
Plasticity dilemma [I] . The basic idea is that an AI system must be capable of learning new 
things (plasticity) without losing previously learned concepts (stability). This has been a 
designing principle mainly investigated within the perspective of neural computation over 
the last thirty years. Some of the proposed solutions include: (a) dual-memory systems 
simulating the presence of short and long-term memory Mi, and (b) cognitive architec¬ 
tures such as the Adaptive Resonance Theory (ART) 1 emulating how the brain processes 
information. In both cases, catastrophic forgetting [^] of previously learned information was 
thereby effectively overcome, however, those approaches are only able to gain new knowledge 
(forgetting is not allowed) without proper management of existing knowledge, thus taking 
away versatility and efficiency to the proposals. 

From our point of view the above principle should point the way to a more general 
principle which also applies to general AI systems for knowledge acquisition. It could be used 
to define “truly” intelligent systems (a) able to support incrementally knowledge acquisition 
without the need to be discarded and retrained repeatedly (which is not cost-effective), (b) 
where the inductive and deductive reasoning algorithms are integrated for such a goal and 
guided by knowledge evaluation metrics, and, finally, (c) able to focus on what is relevant 
knowledge (or dually to discard what is not) by the use of cognitive mechanisms that simplify 
the learning of new knowledge. Following those requirements, below we overview some prior 
work in the area of knowledge acquisition. 

Over the last decades, there has been an extensive work on growing knowledge bases 
from discovered patterns and rules. We find this in different areas, including expert systems, 
machine learning, cognitive science, nonmonotonic logic, information systems and inductive 
(logic) programming. For instance, Lifelong Machine Learning (LML)[7J is concerned with 
the persistent and cumulative nature of learning, namely: (a) capable of retaining and us¬ 
ing prior knowledge, and ( b ) capable of acquiring new knowledge over a series of prediction 
tasks. Similarly, Transfer Learning 0 and multitask learning m [TO] take a similar per¬ 
spective, where it is more explicit that the process is task-oriented, and knowledge and its 
structure does not always play a central role in these systems. ELLA (Efficient Lifelong 
Learning Algorithm) [TlJ and NELL (Never-Ending Language Learner) [12j are two more 
recent approaches to LML, which are able to integrate many capabilities. However, it is not 
easy to export or derive general principles from these works to analyse a knowledge base and 

1 In expert systems, the term knowledge acquisition is usually understood as the incorporation of expert 
knowledge into the system. In this paper, we use the term knowledge acquisition as the process of discovering 
new knowledge from facts and integrating it with the existing knowledge. 

Phenomenon by which neural networks completely forget previously learned information when exposed 
to new one. 
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help in a general incremental knowledge discovery process. 

Other related topics are concept drift and theory revision mmm, where some rules 
are replaced by new rules that are consistent with new experience. This is similar to the 
approach in nonmonotonic and approximate reasoning, and probabilistic or stochastic logic 
representations. The areas of inductive logic programming PH Ej or general inductive 
programming [18. 1HJ have seen several approaches for incremental [20] or cumulative systems 

FT 

A crucial aspect relies on theory and knowledge evaluation. When the theory or hypoth¬ 
esis is considered as a whole and separated from the evidence, we have many well-founded 
proposal, such as the MML principle [21 122] or the similar (but posterior) MDL principle 
[23,[21j. However, for knowledge integration and consolidation it is necessary to assess each 
part of the theory independently, where different parts of the theory can have different de¬ 
grees of validity, probability or reinforcement [25, 126] . However, there is still a separation 
between knowledge and evidence. It would be meaningful to provide a fully integration 
of knowledge and evidence into a hierarchical assessment structure from very specific and 
ground facts to more abstract rules. The perspective of a network or hierarchy of nodes that 
get support from other nodes is more common in the area of link analysis in web graphs 
such as the HITS algorithm [27], PageRank [28] or SALSA [29], or in infometrics. 

Finally, knowledge acquisition has much to learn from the study of human cognition [30], 
We can fully realise the benefits of knowledge acquisition by paying attention 
to the cognitive factors that simplify the learning and processing of the knowledge which 
make the resulting models coherent, efficient, credible, easy to use and understandable [35]. 
In particular, there is a characteristic feature of intelligence that is essential for knowledge 
development: forgetting. Meanwhile human memory has a positive connotation linked with 
performance, forgetting is often associated with negative terms as a state where memory 
does not work properly. Memory and forgetting are two complementary faces of the same 
biological process (synaptic plasticity), being the latter the one of the human mind’s selective 
activities which allow us to abstract concepts. It could be said that, without forgetting, 
memory would be completely useless. The absence of forgetting was masterly described 
by Jose Luis Borges in his tale “Funes, the Memorious” (1942): “To think is to forget 
a difference, to generalise, to abstract. In the overly replete world of Funes, there were 
nothing but details”. Clearly, remembering absolutely everything prevents from having 
abstract thought (the process of generalisation), given that induction and deduction rely 
on this ability. Therefore, in AI systems, forgetting should play an important role when 
acquiring knowledge. Forget has multiple shades of meaning in AI systems: it can refer to 
a complete and irreversible elimination of significant old knowledge while learning new one; 
or it can denote that new learned knowledge is not always kept in the working memory but 
abstractly encoded by identifying their relation to abstract concepts already present in the 
knowledge base. The first meaning clearly refers to those AI systems that are booted up 
for solving individual problems, whereas the latter definition is the desired one: forgetting 
should exist in knowledge bases and learning systems to avoid possible information overflow 
and redundancy, and in order to preserve and strengthen important or frequently used rules 
and remove (or forget) useless ones. 

The ability to focus on what to discard what is not relevant is becoming more relevant 
not only in cognitive science and neuroscience [36], but also in artificial intelligence (e.g., 
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reasoning, planning, decision making). The notion of forgetting, also known as variable 
elimination, has been widely investigated in the context of classical logic (propositional 
and first-order logic) [37, [38], 39J and developed under the notion of logical equivalence, 
that is, logically equivalent formulas (theories) will remain equivalent after forgetting the 
same set of propositional variables or literals. A similar approach but for reasoning from 
inconsistent propositional bases is proposed in |39j. Recently, the concept of forgetting has 
been widespread in other non-classical logic systems from various perspectives such as in logic 
programs unumsa where a semantic forgetting is used instead of developing a number of 
criteria for forgetting atoms; in modal logic [43, [44j 45i where variable elimination is applied 
in the context of intelligent agents; and in description logic (DLs) [IE] for omitting concepts 
and roles in knowledge bases. Forgetting (abstracting from) actions in planning has been also 
investigated in m- Finally, in [48] is proposed a forgetting mechanism for an online learning 
algorithm to learn sequential data with timelines able to gradually expel the outdated data 
that could become a possible source of misleading information. 

From our point of view, forgetting in a knowledge base is closely linked to the previous 
concept of theory and knowledge evaluation. Therefore, inspired by the MML principle, 
the informativeness of a piece of knowledge (in terms of usefulness or the opposite concept, 
irrelevance) can be assessed quantitatively only by its relationship between complexity and 
compression. This lead us to an easy and general concept of forgetting where as much 
information as possible from the original knowledge is preserved, thus setting aside tasks 
such as the preservation of logical equivalences or the satisfaction of semantic properties 
between theories. 

Closely related with the above concept we found memory consolidation , namely, the neu¬ 
rological process of converting information from short-term memory into long-term memory. 
Some studies about episodic memory in humans mm claim that memory traces in the 
hippocampus are not permanent and are occasionally transferred to neocortical areas in 
the brain through a consolidation processes. This consolidation process refers to the idea 
that memories continue to strengthen after they have been formed in the human brain and 
seems a primary factor underpinning memory and forgetting in knowledge bases and learn¬ 
ing systems. Notwithstanding a single recent cognitive model of memory ascribes too much 
importance to consolidation procedures m, we consider that not only forgetting must be 
a prevalent operation in knowledge acquisition, but also consolidating is crucial as well for 
promoting efficient memory storage. 

Given the above overview, we see that it is not easy to develop a new knowledge discovery 
system that is meant to be cumulative. In fact, this research started when developing 
our system gErl [521 E2J- We were looking on a proper foundation for detailed knowledge 
assessment metrics and criteria for forgetting. The need of making general principles available 
for our system and other systems motivated the current work. 

In this work we take a most general approach by considering that we start with an off-the- 
shelf inductive engine (e.g., a rule learner, an inductive logic programming (ILP) system [III, 
fT7] or an inductive programming (IP) system [IHJ HjJj) and an off-the-shelf deductive engine 
(e.g., a coverage checker, an automated deduction system or a declarative programming 
language) and, over them, we build an long-life knowledge discovery system (see Figure [TJ. 

For this purpose, several issues have to be addressed: 
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Figure 1: Architecture of a long-life knowledge discovery approach. 


1. The inductive engine can generate many possible hypotheses and patterns. Once 
brought to working memory we require metrics to evaluate how these hypotheses be¬ 
have and how they are related in the context of previous knowledge. Additionally, at 
any time new evidence can be added as rules to the working space. 

2. As working memory and computational time are limited, we need a forgetting criterion 
to discard some rules which are considered irrelevant in terms of informativeness. 

3. The deductive engine checks the coverage of each hypothesis independently, using the 
background or consolidated knowledge as auxiliary rules, but not other working rules. 
As a result, only when new knowledge is consolidated we can use it for new problems 
or for more difficult examples of the same problem. This means that deduction is 
“modulo the background knowledge”. In other words, working hypotheses must be 
able to use consolidated knowledge but not other working rules. 

4. The promotion of rules into consolidated knowledge must avoid unnecessarily large 
knowledge bases and the consolidation of rules that are useless, too preliminary or in¬ 
consistent. This means that rules must promoted and demoted. Also, if the knowledge 
base becomes too large, finding the appropriate pieces of knowledge for new tasks will 
be less efficient. This means that rules must promoted and demoted to keep a powerful, 
but still manageable knowledge base. 

The idea of coverage graph is used as the basis for structuring knowledge and is delegated to 
the deductive engine. The generation of new rules is delegated to the inductive engine. The 
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crucial part is the definition of appropriate metrics to guide the way knowledge develops. 
For this purpose, the MML principle is used as a sound theoretical ground for the metrics. 

The paper is organised as follows. Section [2] introduces the notion of coverage graph, 
which is our setting for a knowledge base. Over this coverage graph, we are able to introduce 
an adaptation of the MML principle and related metrics in section [3| Section [4] deals with 
knowledge structuring, how rules are forgot, promoted and demoted. We include several 
experiments where we illustrate how knowledge consolidation and forgetting works in section 
[5| Finally, section [6] closes the paper with the contributions and some future work. 


2 Coverage graph 

We consider that ‘rules’ are used for expressing examples, hypotheses and background knowl¬ 
edge. Rules are denoted as e where class(e) = c, c <S C and C is the set of classes, such 
as {false, true}. The set of all possible rules is denoted by 1Z, where W C 1Z is the working 
space or memory, and K C 7Z is the background or consolidated knowledge base. 

Rules are presented as vertexes or nodes V (and we refer them indistinctly) in a directed 
acyclic graph G(V, A) we call coverage graph (which is the DAG representation of a specific 
working space), because the directed edges A represent the coverage relation between the 
different rules as determined by the deductive engine. We say that a rule p a is covered by 
another rule py, if (.K U pi,) \= p a . The precise understanding of the semantic consequence 
operator will depend on the rule representation language used and the deductive engine. 
Hence, if there is an edge a = (/i, v) (or p, — > u), then v is said to be directly covered by p, 
using A0 

The set of ancestors and successors of a node v are defined as anc(v) = {p\p —> v} and 
suc(p) = {u\p —» u} (respectively). Also, we distinguish two subsets of nodes: leaves, nodes 
without successors (\suc(v)\ = 0), where the set of leaves v of class c is denoted as leaves c ; 
and roots, nodes without ancestors (|anc(^)| = 0). 

Figure [2] shows an example of Coverage Graph of a well-known ILP problem [16j: the 
family relationship. In this problem, the task is to define the target relation daughter(X, Y), 
which states that person X is daughter of person Y. W consists of three positive examples 
(rules 1, 2 and 5), two negative ones (rules 3 and 4), and seven selected rules that try to 
generalise and solve the problem (Table [l] right), whereas K is composed of the relations 
female and parent (Table [I] left). Note that the rules in K have not been included in the 
graph for clarity, although they belong to the initial “consolidated knowledge”. 


3 Basic Metrics for Discovered Knowledge Assessment 

In order to select and arrange the set of rules in the working space, various measures of 
usefulness, relevance and consistency have to be derived from the coverage graph. Based on 
the idea that the relevance or usefulness of a rule can be stated by the relationship between 

3 For simplicity, the coverage graphs do not include the edges for the transitive closure of the covering 
relation, i.e., if a node // covers nodes v and 7 , but v also covers 7 , only the edges // —>■ v and v —>■ 7 are 
included in the graph. 
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Figure 2: Coverage Graph of the family relations problem. Green and red nodes refer to 
positive and negative examples respectively. The graph shows rule IDs according to Table 

□ 


Background Knowledge 

Rules 

ID 

Rule 

ID 

Rule 

kl 

parent (ann, mary). 

1 

daughter (mary, ann). 

k2 

parent(ann, tom). 

2 

daughter (eve,tom). 

k3 

parent (tom, eve). 

3 

daughter(tom,ann). 

k4 

parent (tom, ian). 

4 

daughter (eve, ann). 

k5 

female(ann). 

5 

daughter (cris,tom). 

k6 

female (mary). 

100 

daughter(X,Y):- female(Y),parent(Y,mary). 

k7 

female (eve). 

59 

daughter (eve,tom):- female (eve),parent (tom,eve). 



20 

daughter (eve,tom):- female(eve). 



35 

daughter (eve, Y):- female(eve). 



73 

daughter (X,tom):- female(X),parent (tom,X). 



110 

daughter(X,Y):- female(X),parent(Y,X). 



138 

daughter(V,W):- female(X),parent(Y,Z). 


Table 1: Left: Background Knowledge for the family relations problem. Right: Rules of this 
problem in Prolog notation. 
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its own complexity and the complexity of the rules it covers, a general criterion such as 
the Minimum Message Length [2] (MML) can be used as a starting criterion from which to 
derive new metrics. 


3.1 Minimum Message Length 

The Minimum Message Length is one of the most popular selection criterion in inductive 
inference (for a formal justification and its relation to Kolmogorov complexity and the related 
MDL principle, see [54J, 55, 22]). It provides an interpretation of the Occam’s Razor principle: 
the model generating the shortest overall message (composed by the model and the evidence 
concisely encoded using it) is more likely to be correct. This message can be re-stated in a 
Bayesian form [2] with the length of the first part of the message (the model) and the length 
of the second part (evidence covered). The Bayesian theorem, which is the primary concern 
of Bayesian inference, is shown in equation [Tj 


P(H\E) 


P{H) ■ P(E\H) 

P(E) 


P(HDE) 

P(E) 


( 1 ) 


where P(H) is the prior probability of the model H, P(E\H) is the likelihood, and P(E) is the 
probability of the evidence E. An information-theoretic interpretation of MML is that a given 
evidence E of probability P{E ) can be coded by a message of length L(E) = —logoiPiE)) 
[56] . Therefore, taking the negative logarithm of the expression [l] and according to the MML 
philosophy, the length of a hypothesis H given a fixed evidence E (L(H\E) is defined as the 
sum of three simple heuristics: a complexity-based heuristic (which measures the complexity 
of H), a coverage heuristic (which measures how much extra information is necessary to 
express the evidence given the hypothesis H ) and the length of the evidence ( L(E )) which 
equal for all competing hypotheses: 


L(H\E) = L(H) + L(E\H) - L(E) (2) 

By minimising equation [2] we maximise the posterior probability. This involves searching 
for the model that gives the shortest message. 

Apart from its connection with Kolmogorov complexity and Solomonoff induction 
which gives additional support for its use, the MML principle (and the similar MDL principle) 
has been successfully applied in many areas of machine learning, AI and cognitive science. 
However, to our knowledge, the MML principle has always been applied to select between 
hypotheses with respect to some given evidence. In our case, we have a coverage graph where 
rules cover other rules, so they become H and E at the same time. In a way, what we need 
is a hierarchical MML application, with this in mind the MML principle can be adapted to 
be used in our approach with the following considerations: instead of measuring the length 
of a hypothesis H given fixed evidence E, what we want to measure is the length of each 
rule p in W with respect to the rest of rules in W (which includes examples and hypotheses) 
because p can model not only examples, but also other rules. Therefore, L(p\W) is defined 
as the sum of the length of p (L(p)), and the length necessary to express the rules in {W — p} 
not modelled by p (L(W\p)), minus the length of the total rules in W (. L{W )). Formally: 




L(p\W) = L(p) + L(W\p) - L(W) (3) 

Apparently, it just seems a notational change wrt. Eq. [2| This is only true for the first 
term, which is estimated in the same way as the original MML principle. The term L(p) can 
be defined in different ways depending on the rule representation language. For instance, if 
we are using logical or functional rules (as in the family example), we could use the following 
approximation. Given E a set of m s functor symbols of arity > 0, and X a set of nix 
variables, we could define the length of a rule p containing riy, functors and rix variables as 


L(p) = m-£ log 2 (ns + 1) 

3 y l°g2( n * + 1) 


(4) 


Note that we promote variables over constants or functors. 

Table [2] shows the length in bits and the class for the rules in the graph of Figure [2| 


ID 

Up) 

class 

1 

17.844 

+ 

2 

17.844 

+ 

3 

17.844 

— 

4 

17.844 

— 

5 

17.844 

+ 

100 

11.977 


59 

20.036 


20 

11.591 


35 

9.284 


73 

13.114 


110 

9.962 


138 

12.462 



Table 2: Length and class for the rules on the right side of Table [lj 


3.2 MML goes hierarchical: Support 

Following with the equation |3| we are going to reunderstand the terms L(W\p) — L{w ) to be 
adapted to coverage graphs and multiclass settings. Roughly speaking, these terms capture 
the “net profit” of the rules both in terms of support or coverage (length in bits of the rules 
covered). More formally, we define the support of a rule p e W as: 

S(p, W) = L(p) - L(p\W) = L(W ) - L(W\p) (5) 

where L(W) — L(W\p) represents the coverage of a rule p expressed in bits, that is, the 
length of all the rules in W minus the length of the rules not covered by p. Therefore, the 
support of a rule p represents the length of the rules it covers: 
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( 6 ) 


S(p, W) = ^ L(y) 

v:p\=v 

leading to an alternative expression for L{p\W) (eq. [3j) in terms of support: 

L(f,\W) =-S(p,W) + L(p) (7) 

which establishes that maximising S(p, W) and minimising L(p) we minimise L(p\W) which 
involves searching for the rule p that covers the maximum number of rules and has the lowest 
length. 

The following step is to adapt eq. [7] to be used in coverage graphs that does not explicitly 
include the edges for the transitivity of the coverage relation. In order to consider the 
upwards propagation, only the leaves will have an initial support value which is equal to its 
length in bits, and the rest of nodes will distribute it recursively by propagating this support. 
Thus, the new support (S'(p, IT)) adapted to work on coverage graphs is defined as: 

{ L(p) if p G leaves 

Y, S'(v,W) otherwise 

is£suc(p) 

In order to avoid the scenario where the less grounded (upper) nodes get higher and 
higher support values, the support measure is required to satisfy a conservative condition. 
This property is somehow related to the law of conservation of energy, implying that at any 
node in a coverage graph, the sum of the total support flowing into that node is equal to the 
sum of the total support flowing out of that node. 

Now, to make S' conservative we need to divide the support coming from the outcoming 
of a specific node u by \anc{y)\ in order to equally distribute the support of u between all of 
its ancestors. 

Therefore, the new formula used to calculate the support of a rule (S(p, IT)) is defined 
to be equal to: 


{ L[p) if p e leaves 

Y otherwise ( 9 ) 

^ \anc(u)\ 
is£suc(p) 

and leading to an expression for L(p\W) 0 in terms of this conservative support: 

L(p\W) = -S(p,W) + L(p) (10) 

Equation [9] now accomplishes the mandatory conservative condition which could be stated 
such as the support of a node (which depends on its successors) has to be always entirely 
allocated in its ancestors together with the support inherited from other covered nodes (see 

This implies (but not vice versa) that the total sum of the support in the leaves in the 
coverage graph is equal to the total sum of the support at the root nodes. Namely: 

£ s(p,w)= £ sw,w) (ii) 

pGleaves isEroots 
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For each leaf in the coverage graph we have n different paths whereby the support flows 
upwards to root nodes. Whenever a path is forked (an ancestor is found), the support is 
always divided by the number of the outcoming paths, having the ancestors an equally part 
of the support and thus having the roots a proportion of the original support of the leaves 
transitively covered by them. Therefore, if we assume that the total support at the roots is 
different from the total support at the leaves, it means that an external transfer of support 
(which comes from or goes to other sources) has happened. However, accordingly to eq. [9j 
this is not possible and, therefore, the total sum of the support at the roots always remains 
constant and equal to the total support at the leaf nodes (see Figure [3]). 



Figure 3: Graphical representation of the flow of the support by using equation [9] through the 
coverage graph : the support of each leave node is always allocated in the roots. Therefore, 
the total support of the leaves is equal to the total support in the roots 
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^Example 

Viewed through the example in Figure [3] and accordingly to the equation [9] we have that the support 
of the root nodes is 

S(d, W) = S(e, W) = ^ X, W \ s(f, W) = felF + s(Y, W), 


where 


and 


S(x, IF) = 5(o, IF) + S(y, W ) = + 5(c), 


S(a, W) = L(a),S(b, W) = L(b), S(c, W) = L(c) 
thus making the following equations true (accordingly to the formula [9]): 


s(a, W) + - S(d, W) - S(e, W) - ( S(f , W) - = 0 


S(x,W) 


S(x,W ) S(x,W) 


S(x,W ) 




S(y,W) 


S(y,W ) 


and, also, being the total support at leaf nodes ( L(a ) + L(b) + L(c)) equal to the total support at root 
nodes (equation [TI|): 

S(d, W) + S(e, W) + S(f, W) = (feF)) + (feF)) + (feF) + S(y, IF)) 

= S(x,W) + S(y,W) 

H5(a^) + to ) + ( te + 5( c ,^ )) 

= S(a, W) + S(b, W ) + S(c, W) 


Finally, we need to take into account that, since the working space W can accommodate 
examples of different classes, we need our metric to distinguish between them and, hence, 
there are as many support values for each node as many different classes there are in the 
working space, each one holding the conservative property and formally defined to be equal 
to: 


Sc(p,W)± 


L{P ), 

E 

iy£suc(p) 


S c (y,W) 

\anc(y)\ 


with eq. [10] being defined for classes as follow: 


if p G leaves c 
otherwise 


( 12 ) 


- lmw) = SM w) - L(p) ( 13 ) 

The value of L c is interpreted as the hierarchical version of the MML principle, with L c being 
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the lower the better (and obviously — L c the higher the better). 

Following with the Family example, Table [3] shows the support and the negative form of 
L(p\W) (for each class) of the rules in the graph in Figure [ 2 ] 


ID 

L(P) 

class 

A + 

S- 

-L+ 

—L 

1 

17.844 

+ 

17.844 

0.0 

0.0 

-17.844 

2 

17.844 

+ 

17.844 

0.0 

0.0 

-17.844 

3 

17.844 

— 

0.0 

17.844 

-17.844 

0.0 

4 

17.844 

— 

0.0 

17.844 

-17.844 

0.0 

5 

17.844 

+ 

17.844 

0.0 

0.0 

-17.844 

100 

11.977 


8.922 

26.766 

-3.0549 

14.788 

59 

18.791 


8.922 

0.0 

-11.114 

-20.036 

20 

11.591 


8.922 

0.0 

-2.668 

-11.591 

35 

9.284 


8.922 

8.922 

-0.362 

-0.362 

73 

13.114 


26.766 

0.0 

13.651 

-13.114 

110 

9.962 


35.688 

0.0 

25.726 

-9.962 

138 

12.462 


44.61 

26.766 

32.147 

14.303 


Table 3: S and —L(p\W) values (both for the + and — classes) for the rules on the right 
side of Table [l} Taking a look at the table, we cannot decide which is the best rule in global 
terms: we can only establish a ranking per classes (by using the support values) without 
taking into account any other information. 


3.3 Optimality 


By using the support as the sole criterion to rank the rules in W is useful provided there are 
only rules belonging to one class. However, when there are more than one class in W, we need 
to consider the purity or confidence of the rules. In the same spirit of the MML principle, we 
define the optimality as the difference between the cost of coding a rule following equation 
13] for a specific class and the cost of coding the exceptions, i.e.,: the support of the rules 
covered that belong to the other classes. We use a factor 3 indicating the relevance of rules 
being as pure as possible. Formally: 


opt c (p, W)±-/3- L c (p\W) -(l-/3)-J2 SAP, W) 


(14) 


e'ec 

c'^c 


leading to a generic optimality of a rule as: 


opt(p, W) = ma x(opt c (p, W )) (15) 

Following with the Family example, Table[I]shows the optimality values per class (the generic 
optimality in bold) for the rules in the graph of Figure [2] using 3 = 0.5. According to these 
values, rule 110 is the most significant rule, as it can be easily viewed in the coverage graph 
because it covers all the positive examples and no negative one. 
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ID 

L(P) 

class 


S- 

~L+ 

— L 

opt + 

opt 

1 

17.844 

+ 

17.844 

0.0 

0.0 

-17.844 

0.0 

-17.844 

2 

17.844 

+ 

17.844 

0.0 

0.0 

-17.844 

0.0 

-17.844 

3 

17.844 

— 

0.0 

17.844 

-17.844 

0.0 

-17.844 

0.0 

4 

17.844 

— 

0.0 

17.844 

-17.844 

0.0 

-17.844 

0.0 

5 

17.844 

+ 

17.844 

0.0 

0.0 


0.0 

-17.844 

100 

11.977 


8.922 

26.766 

-3.0549 

14.788 

-14.91 

2.933 

59 

18.791 


8.922 

0.0 

-11.114 

-20.036 

-5.557 

-14.479 

20 

11.591 


8.922 

0.0 

-2.668 

-11.591 

-1.334 

-10.256 

35 

9.284 


8.922 

8.922 

-0.362 

-0.362 

-4.642 

-4.642 

73 

13.114 


26.766 

0.0 

13.651 

-13.114 

6.825 

-19.939 

110 

9.962 


35.688 

0.0 

25.726 

-9.962 

12.863 

-22.825 

138 

12.462 


44.61 

26.766 

32.147 

14.303 

2.69 

-15.153 


Table 4: Optimality values (both for the + and — classes) for the rules on the right side of 
Table [TJ Bold values indicates the generic optimality (equation [15]) . Ranking the rules by 
optimality we see that the best rule is 110. 

4 Structuring knowledge: forgetting, promotion and 
demotion 

In our setting, rules are repeatedly generated by the inductive engine and added to the 
working space W. As an answer to the possible never-ending growth of W, it is necessary to 
have mechanisms for forgetting or revising useless pieces of acquired knowledge. Using the 
metrics we have just introduced, we need a mechanism to discard those rules that are not 
useful, are inconsistent or do not get enough support. 

4.1 Forgetting mechanism 

The optimality of a rule p is a core metric to determine its usefulness, but it is also important 
to see whether p could be considered superfluous because it is covered (transitive or directly) 
by another rule of higher optimality. If it is the case, p is mostly redundant and it could be 
discarded safely. This idea leads to the following definition for the permanence of a rule: 

perm c (p, W) = opt c (p) — max(0, max opt c (u )) (16) 

=p 


with a generic permanence: 


perm(p, W) = max(perm c (p, W)) (17) 

c£C 

The lower the value of permanence a rule has, the higher the odds it has to be forgotten. 

When we perform a forgetting step, the coverage graph is affected and coverages are also 
affected. In order to keep as much information about the past support, each rule is provided 
with a trace of its old support. In cognitive systems this is associated to notions such as the 
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Oblivion over a non-leaf node 


Oblivion over a leaf node 


step=0 
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1 
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a 

a 

a 


S(a,W)=L(a) 

S(a,W)=L(a) 

S(a,W)=L(a) 

S(a,W)=L(A) 


Figure 4: (Left) forgetting case (a): an internal node is forgotten (graphically represented 
with a red cross). (Right) forgetting case ( b ): a leaf node is forgotten. Brown data storage 
cylinders graphically represent the concept of “residual” that collects the support (for each 
class) of forgotten leaf nodes. 


preservation of belief and trust even if we forget the particular cases that gave support to a 
given statement. Therefore, the forgetting mechanism will work as follows: 


1. If a non -leaf node is selected to be forgotten, the support of its successors has to be re¬ 
distributed among their ancestors and the ancestors of the forgotten node (see Figure 
HI (left)). 

2. In case there is a forgetting step that removes a leaf node, its support has to be equally 
distributed among the rules that cover it which inherit it as their “residual” support 
value associated to each class c ( res c ) (see Figure [4] (right)). 


Hence, the equation 12 is modified to include the residual: 


smw) ± 


m 


res c + 


E 

vGsuc(p) 


S c (v,W) 

\anc(v)\ 


if p £ leaves c 
otherwise 


(18) 


where res c is initially set as 0. For each forgetting step, the support of forgotten nodes is 
distributed among the outcoming nodes increasing their res c value, but if the last forgetting 
step removes a node without ancestor nor successors and a non-zero res c , this value cannot 
be distributed and, therefore, is lost. These results in a decrease of the total support of the 
graph: although the support will remain conservative, the total amount will be lower than 
the total support of the coverage graph before the forgetting steps. Consequently, in the 
end some rules may have an under-estimated support value in terms of how many rules (of 
different classes) cover (see Figure [5]). 
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step=1 


step=2 


step=3 





Figure 5: Forgetting mechanism performed over a complete branch. The conservative prop¬ 
erty over the support measure occurs in all steps, but the initial amount of support at step = 0 
( S(p,W) = L(A )) has been reduced at the last step = 4 ( S(p,W) = 

flyleaves flyleaves 

due to the forgetting mechanism. 


In order to clarify how this mechanism works we illustrate this with the Family example. 
Figure [6] shows the evolution of the coverage graph in Figure [2] and its measures (see Table [6]) 
through nine consecutive forgetting steps, where the rule with lowest permanence is forgotten 
in each step (shown with a grey square). For instance, in step 1, we see that rule number 59 
is redundant because it is covered by a more significant rule (with ID 110), and it has the 
lowest value of permanence (see Table [5] (step 1)). Thus, rule 59 is forgotten, the coverage 
graph is redrawn (see Figure [6] (step 2)) and the metrics are recalculated if necessary (see 
Table [5] (step 2)). In step 2 (and other steps where a leaf node is deleted), its support is 
distributed equally among its ancestors and this distributed support becomes part of their 
residual or intrinsic support (res c ). 

In this example, we have forgotten one rule at a time, but the actual pace and number 
of rules to forget can be tuned to the purpose of the system. 

4.2 Consolidated knowledge: promotion and demotion 

Finally, some of the rules with good indicators in the working space have to be eventually 
promoted to consolidated knowledge (or belief). This has to be a careful process, as the 
consolidated knowledge will be used by the deductive engine to calculate coverage. This 
means that an inconsistent rule that is promoted to the consolidated knowledge may have 
important consequences on the behaviour of the system. 

The promotion function can be tuned for the application, but a general choice is to use 
a threshold 6 P on the optimality to consolidate or promote a rule to a belief status in B. 

When a rule is promoted to consolidated knowledge, it cannot be target of the forgetting 
mechanism and, hence, be forgotten. It may happen that this rule can be eventually removed 
from the consolidated knowledge. Therefore, the promotion system is mirrored by a demotion 
system, with the use of another threshold 9d■ The original background knowledge ( B 0 ) cannot 
be demoted (and forgotten). 
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Figure 6: Coverage Graph of the family problem. Green and red nodes refer to positive and 
negative examples respectively. Nodes with a red cross represent the candidate rules to be 
forgotten. Nodes with a thick due square represent those rules that have been consolidated. 


In the example in Figure [6j we have established 9 P equal to the average optimality of 
all the rules in the working space. Then, in step 1, all the rules that exceed this average 
value will be consolidated to the background base (rules 110 and 73). Any rule that is 
consolidated cannot be target of the forgetting mechanism until it is demoted to the working 
space again (in the example, we have considered a demoting threshold 6,i equal to 9 P ). Thus, 
in Table [5] (step 5), rule 73 has the lowest permanence value (perm( 73) = —6.037) but 35 
(perm( 35) = —4.642) is forgotten instead, because the former is a consolidated rule. 

5 Experiments 

As mentioned in section [lj one of the issues in many cognitive systems (especially connex- 
ionistic, either artificial or biological) is the Stability-Plasticity dilemma. We claim that 
our approach is able to address this issue in a long-life learning process. For this purpose, 
we have conducted an experimental evaluation to explore the following questions: (a) is it 
possible to gradually generate a large repository of consolidated knowledge assessing the 
usefulness of the rules? (b) is our approach able to forget or revise the existing knowledge in 
order to generate a rich and reusable knowledge base? and (c) how are the process and the 
resulting knowledge structure understood in terms of cognitive systems that must discover 
and develop knowledge incrementally? We want to illustrate these features in one single 
domain. The ultimate goal of these experiments is to see whether the framework is general 
enough to work with off-the-shelf inductive and deductive engines, to better understand how 
the metrics and procedures work, and finding whether they may require some tuning or 
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STEP 1 

STEP 2 

STEP 3 

ID 

*(P) 

S+ 

S 

-L + 

-L 

Opt + 

Opt 

Perm 

s+ 

S 

-i + 

-L 

Opt + 

Opt 

Perm 

s+ 

S 

-L + 

-L 

Opt+ 

Opt 

Perm 

1 

17.844 

17.844 

0.0 

0.0 

- 17.844 

0.0 

- 17.844 

- 12.863 

17.844 

0.0 

0.0 

- 17.844 

0.0 

- 17.844 

- 12.863 








2 

17.844 

17.844 

0.0 

0.0 

- 17.844 

0.0 

- 17.844 

- 12.863 

17.844 

0.0 

0.0 

- 17.844 

0.0 

- 17.844 

- 12.863 

17.844 

0.0 

0.0 

- 17.844 

0.0 

- 17.844 

- 12.863 

3 

17.844 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

4 

17.844 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

5 

17.844 

17.844 

0.0 

0.0 


0.0 

- 17.844 

- 12.863 

17.844 

0.0 

0.0 


0.0 

- 17.844 

- 12.863 

17.844 

0.0 

0.0 


0.0 

- 17.844 

- 12.863 

100 

11.977 

8.922 

26.766 

- 3.054 

14.788 

- 14.91 

2.933 

2.933 

8.922 

26.766 

- 3.054 

14.788 

- 14.91 

2.933 

2.933 

8.922 

26.766 

- 3.054 

14.788 

- 14.91 

2.933 

2.933 

59 

20.036 

8.922 

0.0 

- 11.114 

- 20.036 

- 5.557 

- 14.479 

- 14.479 















20 

11.591 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

35 

9.284 

8.922 

8.922 

- 0.362 

- 0.362 

- 4.642 

- 4.642 

- 4.642 

8.922 

8.922 

- 0.362 

- 0.362 

- 4.642 

- 4.642 

- 4.642 

8.922 

8.922 

- 0.362 

- 0.362 

- 4.642 

- 4.642 

- 4.642 

73 

13.114 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

110 

9.962 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

10.173 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

10.173 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

10.173 

138 

12.462 

44.61 

26.766 

32.147 

14.304 

2.69 

- 15.153 

2.69 

44.61 

26.766 

32.147 

14.304 

2.69 

- 15.153 

2.69 

44.61 

26.766 

32.147 

14.304 

2.69 

- 15.153 

2.69 




STEP 4 

STEP 5 

STEP 6 

ID 

*(P) 

5 + 

5 

-L + 

-L 

Opt+ 

Opt 

Perm 

°s+ 

5 

-L + 

-L 

Opt+ 

Opt 

Perm 

s+ 

5 

-L + 

-L 

Opt+ 

Opt 

Perm 

1 

17.844 






















2 

17.844 






















3 

17.844 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 7.394 

4 

17.844 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 2.933 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 7.394 

5 

17.844 

17.844 

0.0 

0.0 


0.0 

- 17.844 

- 12.863 















100 

11.977 

8.922 

26.766 

- 3.054 

14.788 

- 14.91 

2.933 

~ 2.933 

8.922 

26.766 

- 3.054 

14.788 

- 14.91 

2.933 

2.933 

8.922 

26.766 

- 3.054 

14.788 

- 19.371 

7.394 

7.394 

59 

20.036 






















20 

11.591 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

35 

9.284 

8.922 

8.922 

- 0.362 

- 0.362 

- 4.642 

- 4.642 

- 4.642 

8.922 

8.922 

- 0.362 

- 0.362 

- 4.642 

- 4.642 

- 4.642 








73 

13.114 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

110 

9.962 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

10.173 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

10.173 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

12.863 

138 

12.462 

44.61 

26.766 

32.147 

14.304 

2.69 

- 15.153 

2.69 

44.61 

26.766 

32.147 

14.304 

2.69 

- 15.153 

2.69 

44.61 

26.766 

32.147 

14.304 

- 1.77 

- 10.691 

- 1.77 




STEP 4 

STEP 5 

STEP 6 

ID 

*(P) 

5 + 

5 

-1+ 


Opt+ 

Opt 

Perm 

S+ 

S 

-L+ 


Opt + 

Opt 

Perm 

S + 

S 

-1+ 

-£ 

Opt+ 

Opt 

Perm 

1 

17.844 






















2 

17.844 






















3 

17.844 






















4 

17.844 

0.0 

17.844 

- 17.844 

0.0 

- 17.844 

0.0 

- 7.394 















5 

17.844 






















100 

11.977 

8.922 

26.766 

- 3.054 

14.788 

- 19.371 

7.394 

7.394 

8.922 

26.766 

- 3.054 

14.788 

- 19.371 

7.394 

7.394 

8.922 

26.766 

- 3.054 

14.788 

- 19.371 

7.394 

7.394 

59 

20.036 






















20 

11.591 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

8.922 

0.0 

- 2.668 

- 11.591 

- 1.334 

- 10.256 

- 1.334 

35 

9.284 






















73 

13.114 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

26.766 

0.0 

13.651 

- 13.114 

6.825 

- 19.939 

- 6.037 

110 

9.962 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

12.863 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

12.863 

35.688 

0.0 

25.726 

- 9.962 

12.863 

- 22.825 

12.863 

138 

12.462 

44.61 

26.766 

32.147 

14.304 

- 1.77 

- 10.691 

- 1.77 

44.61 

26.766 

32.147 

14.304 

- 1.77 

- 10.691 

- 1.77 









Table 5: Metrics for the family problem in nine steps of forgetting. Identifiers refer to the 
rules in Table [U Rows filled in fine refers to those rules which have been consolidated. Rows 
filled in orange (non-consolidated rules with lowest Perm ) refer to the rule candidate to be 
forgotten. 


improvement to the framework before addressing other problems. 

5.1 Methodology 

We will focus on the problem of learning the rules of chess by observation. In particular, we 
focus on learning a model of legal moves of different pieces from a set of legal and illegal move 
examples (extracted from [58]). In our framework, the legal moves are the positive examples 
and the illegal moves the negative ones (so we have two classes). Each example represents 
a move of a specific piece on an empty board. Therefore, a move is represented by a triple 
from the domain Piece x Pos x Pos , where the second and third components represent, 
respectively, the piece’s initial position and its destination on a chessboard. Positions are 
represented by a tuple from the domain File x Rank where files (a-h) stand for columns 
and ranks (1-8) stand for rows. For instance, Figure [7] illustrates all the possible moves of 
a knight from a specific initial position (K) to several other positions (K r ). We will use a 
Prolog notation (as in the example in the previous section). 
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Figure 7: Possible moves of the knight from position (d,5). The particular legal move from 
K to K' will be represented as move(knight,pos(d,5) ,pos(e,3)). 
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a b c d e f g h 


The only background predicate used is the absolute difference, diff(X, Y), that calculates 
the distance between X and Y, where both X and Y can be ranks or files (see Table [6]). 


ID 

Rule 

ID 

Rule 

K1 

project(a,l). 


rdiff(Rankl ,Rank2,Diff) 

K2 

project(b,2). 

Kll 

rank(Rankl), rank(Rank2), 

K3 

project(c,3). 


Diffl is Rankl-Rank2, 

K4 

project(d,4). 


abs(Diffl,Diff). 

K5 

project(e,5). 


fdiff(File 1 ,File2,Diff) 

K6 

project(f,6). 


file(Filel), file(File2), 

K7 

project(g,7). 

K12 

proj ect(File 1 ,Rankl), 

K8 

project(h,8). 

proj ect(File2 ,Rank2), 

K9 

abs(X,X)X>=0. 


Diffl is Rankl-Rank2, 

K10 

abs(X,Y)X<0, Y is -X. 


abs(Diffl,Diff). 


Table 6: Background knowledge for the chess problem. 

The challenge we would like to face is knowledge discovery and acquisition in a progressive 
way from examples provided incrementally. A random set of chess moves from all chess pieces 
in the game except the pawn (rook, bishop, knight, queen and king) is given. This includes 
positive and negative examples (28 and 12 examples respectively). We also consider that 
an inductive engine is generating rules during the whole process (according to the working 
space and using the consolidated knowledge as background knowledge) and they are arriving 
to the system in a random order as well. In our case, we have taken the rules generated by 
the ILP system Progol [59] (60 in total). How many examples and rules are given for each 
step of the system is defined following a geometric distribution. Formally, the probability 
that k examples (and similarly for rules) are given is Pr(X = k) = (1 — p) k ~ x • p where 
k is 1,2,3,... and p is the probability of success (we set it to 0.5). In order to better 
mimic a situation where the inductive engine can produce rules it has already generated (as 
otherwise we would need to keep trace of all this), it is more realistic to use this distribution 
with replacement. Similarly, as the same move can appear repeatedly in chess, we have also 
considered replacement for the set of examples. 

In this experiment, we have set the consolidation criterion with a threshold of optimality 
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greater than the average of the optimality value of the rules in W (provided that it is above 
the average optimality of the evidence), namely, 


E opt(v,W) 


opt(p , W) > max{ 0, 


w 


m 


-) 


(19) 


Furthermore, since we want the consolidated knowledge to represent legal chess moves, we 


have set the (3 parameter equal to 0.1 in equation [M] with the aim of penalising those rules 
that are not pure. 


5.2 Consolidation without forgetting 

In a first experiment we try to show what would happen without applying the forgetting 
mechanism and check whether the MML-based measures work successfully for knowledge 
acquisition: are the final consolidated knowledge useful to solve the problem given the evi¬ 
dence? Figure [8] shows the evolution of the learning process during 500 steps. As no rules 
are forgotten, the rule population (dashed brown line) reaches its maximum value (100) and 
it stagnates ignoring any new evidence which arrives to the system (because they are already 
placed in W) from step 180 onwards. In this case we have assumed that all the evidence of 
the chess problem can be allocated in W, however it could be the case that all knowledge of 
a problem will not fit into W (memory restrictions) thus collapsing with no improvement. 
The same applies to both the average optimality of all rules (dashed blue line) and the con¬ 
solidated ones (dashed green line) which, since no more new rules are allocated into W, no 
further learning or knowledge improvement can take place. Table [7] shows the consolidated 
rules at step 500 where we can see that they almost represent all the legal chess moves (only 
two movements of the knight are missing in this set) and there is only one rule (x20) which, 
despite representing a legal move, does not completely generalise the movement of the piece 
(king). This is a good result as the working space is large enough to accommodate all these 
rules (and many other less significant rules). See Table 12 in Appendix [T] for all the rules 
in W at step 500 and Figure IT for their coverage relations. The conclusion we can draw 
from these results is that the metrics used to measure the usefulness of the rules provide a 
guarantee of promoting those rules that, having the maximum compression, best describe 
the problem. 


5.3 Consolidation with forgetting 

After that, we repeat the same experiment, but using the forgetting mechanism. This tries to 
represent a situation where we have bounded resources, in this case a more limited working 
space, so it is necessary to forget rules in order to allocate new ones. What we want to 
show is that if our approach is able to find a solution to a certain problem without the 
use of the forgetting mechanism, a suitable (and possibly better) solution to the problem 
should exist having bounded resources and by using the forgetting mechanism. In order to do 
that, we have executed several configurations with varying maximum number of rules in the 
working space (|kF| £ {(20,30,40,50,60,70,80,90)}) and every time the limit is exceeded 
the forgetting process is launched, forgetting up to 25%, 50% or 75% of the most meaningless 
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Figure 8 : Evolution of some indicators for the chess problem without the forgetting mecha¬ 
nism: #Examples and #Rules show the examples that arrive and the rules that are generated 
by the inductive engine for each step, 7 ^Cons shows how many rules there are in the con¬ 
solidated knowledge (initially the background knowledge) and #Population shows the total 
number of rules (magnitudes shown on the left y- axis). AvgOpt and AvgOptCons show, re¬ 
spectively, the average optimality for all rules and the average optimality for all consolidated 
rules (magnitudes shown on the right y-axis). This Figure shows how, after the working 
space is filled with all the evidence and generalised rules, the metrics become stable. 


ID 

Rule 

L(P) 

s+ 

5_ 

-l + 

-L_ 

Opt + 

Opt_ 

Perm 

rl5 

move(rook,pos(A,B),pos(A,C)). 

22.133 

49.594 

0 

27.461 

-22.133 

2.746 

-46.847 

2.746 

ql9 

move(queen,pos(A,B),pos(C,B)) 

13.214 

27.052 

0 

13.838 

-13.214 

1.383 

-25.668 

1.383 

qi2 

move(queen,pos(A,C),pos(A,D)) 

13.214 

27.052 

0 

13.838 

-13.214 

1.383 

-25.668 

1.383 

rl6 

move(rook,pos(A,B),pos(C,B)). 

20.455 

33.815 

0 

13.36 

-20.455 

1.336 

-32.479 

1.336 

xl8 

move(king,pos(A,B),pos(C,D))rdiff(B,D,l), fdiff(A,C,l). 

34.275 

40.578 

0 

6.303 

-34.275 

0.63 

-39.947 

0.63 

x20 

move(king,pos(A,B),pos(A,C)) rdiff(B,D,l). 

22.918 

27.052 

0 

4.134 

-22.918 

0.413 

-26.638 

0.413 

xl3 

move(king,pos(A,B),pos(C,B)) > fdiff(A,C,l). 

22.918 

27.052 

0 

4.134 

-22.918 

0.413 

-26.638 

0.413 

q23 

move(queen,pos(A,B),pos(C,D)) rdiff(B,D,E),fdiff(A,C,E). 

28.993 

32.462 

0 

3.469 

-28.993 

0.346 

-32.115 

0.346 

blO 

move(bishop,pos(A,B),pos(C,D))rdiff(B,D,E), fdiff(A,C,E). 

24.534 

26.3 

0 

1.766 

-24.534 

0.176 

-26.123 

0.176 


Table 7: Consolidated rules and metrics for the chess problem without the forgetting mecha¬ 
nism at step 500. IDs in bold represent those rules that perfectly generalise the legal moves 
of the chess pieces. 
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rules (those with the lowest perm value). Each different configuration has been launched 10 
times, hence, there are 240 executions in total. 



Table 8: Heat map showing the percentage of times a rule has been consolidated for each 
different configuration (maximum number of rules, 20-100, percentage of rules forgotten, 
25%-75%). The last row shows the results without forgetting. Each cell represents 10 
repetitions. The latter row (|W = 100) represents the reference solution, namely, the 
solution obtained by the experiment without forgetting (previous section). Rules (p) in bold 
are those rules that belong to the solution of the problem. As it can be seen, even with very 
limited resources, the consolidated knowledge improves the reference solution. 

Table [8] is a Heat map showing, for each possible configuration (|TT| x forgetting (%)) 
how many times a specific rule appears in the consolidated knowledge in 10 repetitions, from 
white (0 times), light yellow (1 time) to dark green (10 times). Rules that are not represented 
in the Heat Map is because they have not been consolidated at any time. Knowing that the 
consolidated rules by the first experiment (Table [7j) are those represented in the bottom 
row (| IT | = 100), it is easy to see that not only the set of consolidated rules almost always 
includes the reference solution (even with very limited resources), but also the forgetting 
criterion allows the system to include those rules that perfectly generalise the moves of the 
king (rules in bold). The rest of rules included in the consolidated set in each experiment 
also generalise different movements of the pieces and, in some cases, they could disappear 
from this set by using a more restrictive consolidation criterion (i.e., by using the average of 
the optimality plus n times its standard deviation). See Table [13] in Appendix [?] for all the 
rules in IT at step 500. 

In order to compare both experiments, Figure [9] shows the evolution of the system during 
500 steps for one representative setting of the 24 configurations (maximum number of rules 
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equals to 60 and up to 50% of rules forgotten in each forgetting step). Now, the variations in 
the amount of consolidated rules (dotted black line) and rules in the working space (dashed 
brown line) allow us to observe how the forgetting mechanism works (every 30 steps approx¬ 
imately). Table [9] presents the consolidated rules at the final step (500). In this case, this 
set perfectly generalises all the legal moves of all the chess pieces. The system has reached a 
stable situation in which the number of consolidated rules (dotted black line) remains almost 
constant from step 250. The average optimality of both the consolidated rules (dashed green 
line) and all the rules (dashed blue line) have an increasing trend due to the distribution 
with replacement used to populate the working space. The appearance of new rules in the 
system or the execution of the forgetting mechanism mainly affect the average optimality of 
W (dashed blue line): every time it runs, the working space is cleaned of useless rules which 
strongly affects the metrics of the rules in W (and to a lesser extent to the consolidated set 
of rules (green line)) that have to be recalculated. Compared with the former experiment, 
the number of rules in W has been reduced (with one order of magnitude (lOx) speedup in 
execution) obtaining a better set of consolidated knowledge: it includes all the rules that 
solve the chess problem, including the two legal moves of the knight, rules k22 and k2A, 
which were missing from the consolidated knowledge in the first experiment. 



Figure 9: Evolution of the same indicators as in Figure [8] for the chess problem with the 
forgetting mechanism (for a configuration with maximum number of rules 60 and up to 
50% of rules forgotten for each forgetting step). Now we see a bumpier picture, where the 
forgetting mechanism takes place every 30 steps approximately. 


5.4 Incremental knowledge acquisition 

Finally, one last experiment tries to show the capability of our approach for the incremental 
learning of new knowledge from previously consolidated concepts. This experiment is divided 
in two phases: in the first one we have only taken rules and examples of moves of the rook 
and bishop chess pieces (15 and 30 rules respectively) providing the system with them in the 
same way as in the previous experiment. The consolidation criterion has not been changed, 
but the maximum number of rules in the working space has been established to 15 (in order 
to allow the forgetting mechanism to work) and the percentage of meaningless rules that are 
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ID 

Rule 

L(P) 

5+ 

5_ 

-L + 

-L_ 

Opt + 

Opt_ 

Perm 

xl8 

move(king,pos(A,B),pos(C,D)) rdiff(B,D,l), fdiff(A,C,l). 

34.275 

662.774 

0 

628.499 

-34.275 

62.849 

-599.924 

-22.681 

xl3 

move(king,pos(A,B),pos(C,B)) > fdiff(A,C,l). 

22.918 

459.884 

0 

436.966 

-22.918 

43.696 

-416.187 

-41.834 

k22 

move(knight,pos(A,B),pos(C,D)) > rdiff(B,D,2), fdiff(A,C,l). 

34.275 

446.358 

0 

412.083 

-34.275 

41.208 

-405.149 

29.394 

q23 

move(queen,pos(A,B),pos(C,D))rdiff(B,D,E),fdiff(A,C,E). 

28.993 

437.34 

0 

408.347 

-28.993 

40.834 

-396.505 

-1.513 

x20 

move(king,pos(A,B),pos(A,C)) rdiff(B,D,l). 

22.918 

392.254 

0 

369.336 

-22.918 

36.933 

-355.32 

8.116 

rl5 

move(rook,pos(A,B),pos(A,C)). 

22.133 

389.999 

0 

367.866 

-22.133 

36.786 

-353.212 

6.602 

ql9 

move(queen,pos(A,B),pos(C,B)) 

13.214 

365.202 

0 

351.988 

-13.214 

35.198 

-330.003 

-7.149 

k24 

move(knight,pos(A,B),pos(C,D))rdiff(B,D,l), fdiff(A,C,2). 

34.275 

369.71 

0 

335.435 

-34.275 

33.543 

-336.166 

25.561 

ql2 

move(queen,pos(A,C),pos(A,D)) 

13.214 

311.098 

0 

297.884 

-13.214 

29.788 

-281.309 

-12.559 

rl6 

move(rook,pos(A,B),pos(C,B)). 

20.455 

284.046 

0 

263.591 

-20.455 

26.359 

-257.686 

-3.824 

blO 

move(bishop,pos(A,B),pos(C,D)) •_ rdiff(B,D,E), fdiff(A,C,E). 

24.534 

275.028 

0 

250.494 

-24.534 

25.049 

-249.978 

12.731 


Table 9: Consolidated rules and metrics (as in Table [7]) for the chess problem with the 
forgetting mechanism at step 500 (for a configuration with maximum number of rules 60 
and up to 50% of rules forgotten for each forgetting step). IDs in bold represent those rules 
that perfectly generalise the legal moves of the chess pieces. 


forgotten for each forgetting process up to 25%, due to the smaller size of the working set. 
In Table [10] we can see the set of consolidated rules after 100 steps. This set contains the 
rules that perfectly generalise all the legal moves of the rook and the bishop. In the first 100 
steps of Figure [10] we can see how the forgetting and consolidation mechanisms work. This 
time, due to the lower maximum number of rules allowed in the working space, the lower 
percentage of rules forgotten and the geometric distribution used to provide the rules, the 
forgetting mechanism runs here every few steps, showing non-constant sawtooth-like wave 
ramps for the number of rules in the working system (dashed brown). However, the number 
of consolidated rules remains constant almost from step 45 to the end of this stage (100). 


ID 

Rule 

*(P) 

s + 

S_ 

-L + 

-L_ 

Opt + 

Opt_ 

Perm 

blO 

move(bishop,pos(A,B),p° s (C,D)):- rdiff(B,D,E), fdiff(A,C,E). 

24,534 

590,635 

0 

566,101 

-24,534 

56,61 

-534,024 

24,904 

rl5 

move(rook,pos(A,B),pos(A,C)). 

22,133 

277,282 

0 

255,149 

-22,133 

25,514 

-251,767 

25,514 

rl6 

move(rook,pos(A,B),pos(C,B)). 

20,455 

223,178 

0 

202,723 

-20,455 

20,272 

-202,905 

20,272 

r7 

move(rook,pos(A,2),pos(B,2)). 

19,718 

175,838 

0 

156,12 

-19,718 

15,612 

-160,226 

-4,659 

rl4 

move(rook,pos(A,2),pos(C,D)). 

21,133 

162,311 

0 

141,178 

-21,133 

14,117 

-148,193 

14,117 

r9 

move(rook,pos(a,B),pos(h,B)). 

19,133 

121,733 

0 

102,6 

-19,133 

10,26 

-111,473 

-10,011 


Table 10: Consolidated rules and metrics (as in Table [7]) for the chess problem (rook + 
bishop moves) at step 100. All rook and bishop legal moves are covered by these rules and 
no better rules can be obtained. 

In the second phase, we provided the system with a new set of rules and examples (10 and 
20 rules respectively) only representing moves of the queen chess piece. Apart from using 
the background knowledge that is provided initially, it should also be possible at this point 
to use the previously learned moves of the rook and the bishop in order to express the moves 
of the queen. This is what the inductive engine can take advantage of. Table [TT| shows the 
set of consolidated rules which contains the previously consolidated rules that generalise the 
legal moves of the rook and bishop, and a new set of rules that represents the legal moves of 
the queen. This latter set includes a pair of rules (g29 and c/25) that use the rook and bishop 
rules and represent all the possible moves of the queen piece: q25 which covers both the 
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Figure 10: Evolution of the same indicators as in Figure [8] for the incremental chess problem 
(rook and bishop moves in the first 100 steps, and queen moves in the following 100 steps) 
with the forgetting mechanism. We see a non-constant sawtooth-like picture for the number 
of rules in the working space where the forgetting mechanism takes place every little number 
of steps due to the small amount of rules allowed and the low percentage of rules forgotten in 
every forgetting step. Nonetheless, the consolidated rules became constant in each different 
learning process. 


horizontal and vertical moves of the queen; and q29 which covers the diagonal movement. 
The second half of Figure [To] (from step 100) shows how the forgetting mechanism runs even 
more frequently than previously (dashed brown) due to the increment of consolidated rules 
(that cannot be targeted by forgetting). Again, the number of consolidated rules (dotted 
black line) remains constant most of the time (from step 140 to step 200). 


ID 

Rule 

£(P) 

S+ 

S_ 

-L + 

-L_ 

Opt + 

Opt_ 

Perm 

blO 

move(bishop,pos(A,B),pos(C,D)) rdiff(B,D,E), fdiff(A,C,E). 

34,275 

590,635 

0 

566,101 

-24,534 

56,61 

-534,024 

56,61 

q29 

move(queen,pos(A,B),pos(C,D))move(bishop,pos(A,B),pos(C,D)) 

34,275 

432,832 

0 

405,116 

-27,716 

40,511 

-392,32 

40,511 

q25 

move(queen,pos(A,B),pos(C,D))move(rook,pos(A,B),pos(C,D)) 

22,133 

417,051 

0 

389,335 

-27,716 

38,933 

-378,117 

38,933 

rl5 

move(rook,pos(A,B),pos(A,C)). 

22,918 

277,282 

0 

255,149 

-22,133 

25,514 

-251,767 

25,514 

rl6 

move(rook,pos(A,B),pos(C,B)). 

34,275 

223,178 

0 

202,723 

-20,455 

20,272 

-202,905 

20,272 

ql2 

move(queen,pos(A,C),pos(A,D)) 

13,214 

173,583 

0 

160 

-13,214 

16,036 

-157,546 

16,036 

r7 

move(rook,pos(A,2),pos(B,2)). 

28,993 

175,838 

0 

156,12 

-19,718 

15,612 

-160,226 

-4,659 

rl4 

move(rook,pos(A,2),pos(C,D)). 

22,918 

162,311 

0 

141 

-21,133 

14,117 

-148,193 

14,117 

r9 

move(rook,pos(a,B),pos(h,B)). 

24,534 

121,733 

0 

102,6 

-19,133 

10,26 

-111,473 

-10,011 


Table 11: Consolidated rules and metrics (as in Table [T|) for the chess problem (queen moves) 
at step 200 (the 100 firsts steps for learning the rook and bishop moves, and the 100 following 
steps for learning the moves of the queen). All legal moves of the queen are covered by taking 
advantage of previously learned moves of the rook and bishop, whose legal moves are also 
covered by this set. 
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5.5 Discussion 

In an effort to facilitate an understanding of whether our approach is able to effectively and 
incrementally grow a knowledge base by using appropriate evaluation metrics and useful cog¬ 
nitive abilities for addressing the knowledge acquired, we have performed some experiments 
over a well-known scientific domain, the chess problem. As we have said, the ultimate goal 
is not to validate the approach but to provide some insight into both its generality, efficiency 
and the much-needed use of forgetting and consolidation cognitive procedures in incremental 
and developmental approaches for knowledge discovery. In order to shed some light on these 
aspects, we will refer to the questions raised at the beginning of this section. 

From the above experiments, we see that the repository of rules can be well structured 
and ranked by the metrics and the system consolidates those rules that are appropriate, 
therefore responding affirmatively to the first question (a). Regarding question (b), we also 
see that a moderate limitation of working space with forgetting is even capable to improve 
the identification of the rules to be consolidated, and, what is better, prevents the system 
for stagnating or collapsing in situations where we have bounded resources. Finally, in 
connection with question (c), we see the behaviour in an incremental setting, where the 
knowledge can be used in new tasks; one of the principles of developmental cognition. 

Consequently, the proposed approach for knowledge acquisition is a favourable compro¬ 
mise to the stability-plasticity dilemma, which is characterised as: 

• Too much plasticity will result in previously learned knowledge being constantly forgot¬ 
ten. However, the promotion and demotion mechanisms together with the evaluation 
metrics rank and structure the knowledge allocated in the working space avoiding 
useful knowledge losses. 

• Too much stability will impede the efficient coding of new learnt knowledge. However, 
the forgetting mechanism also together with the evaluation metrics is in charge of 
removing those meaningless and redundant knowledge. 

6 Conclusions 

Learning a set of rules from data is nowadays a well-known problem for which many ap¬ 
proaches exist. However, the use of background knowledge and the consolidation of new 
knowledge is one of the conspicuous problems in the understanding and creation of cognitive 
systems, and the management of more long-life knowledge discovery systems. The organisa¬ 
tion of complex knowledge structures in terms of coverage graphs allows a straightforward 
and principled approach to knowledge acquisition, consolidation (promotion), revision (de¬ 
motion) and forgetting. All this can be applied and analysed at a meta-level, with the use 
of off-the-shelf deductive and inductive engines. This modularity, and the ability of dealing 
with declarative knowledge bases opens up a range of applications in knowledge discovery, 
developmental cognition, expert systems and other intelligent systems that are meant to 
have a non-ephimeral life. 

The main contributions of this work are: (1) The first extension of the MML principle 
to a knowledge network (in the form of coverage graph). While the MML principle has a 
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Bayesian inspiration, the metrics are more flexible than actual probabilities, stauncher when 
pieces of the working space are removed, and can be combined into metrics for different 
processes. (2) We show that the development of a formal epistemology to support knowl¬ 
edge discovery, in terms of how the knowledge can be acquired and justified, supports a 
constructive and developmental way to define appropriate knowledge acquisition processes. 
In particular, we have seen how cognitive procedures as the forgetting criterion are not only 
necessary when the working space is finite but it can even be beneficial in our setting. (3) 
Our approach is parametrisable to other cognitive or intelligent systems, as it works at a 
meta-level and is independent of the actual deductive and inductive mechanisms that are 
used underneath. (4) The nonmonoticity problem of knowledge acquisition and revision is 
approached in a more lightweight and robust way, and the system can cope with redundancy 
and even inconsistency without heavy conflict resolutions or complex semantic artifacts. (5) 
The problem of catastrophic forgetting and, thus, The Stability-Plasticity dilemma has been 
effectively overcome when acquiring knowledge allowing to our approach not only gain new 
knowledge, but also addressing it efficiently. (6) Its adaptive an off-the-shelf characteristics 
allow to feed our approach on dynamic data in real time, or near real time. 

Given the flexibility of the approach we consider many avenues of future work. We plan 
to apply the setting to some other applications, by using the same or other deductive and 
inductive engines, and keep on with the integration into our learning system gErl [52]. Fur¬ 
thermore, it is also of interest the application of the principles used (MML evaluations and 
cognitive mechanisms) in other kind of AI systems such as decision support systems in order 
to help them make better decisions based on the best available data. Finally, two further 
desirable characteristics for our approach are also likely to be part of our future research: 
(a) interactiveness, namely, the ability to find an additional (human or not) source input if 
a problem statement is ambiguous or incomplete; (b) contextuality, in terms identify, under¬ 
stand and extract contextual elements such as syntax, semantics, domain , time, location, 
goal, ..., which may be useful to move beyond the current knowledge acquisition systems. 
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7 Appendix 



Figure 11: Coverage graph that represents the coverage relations between the individuals for 
the chess problem (without oblivion) at step 400. The metrics are shown in Table 12. Green 
and red nodes refer to positive and negative examples respectively. Original background 
knowledge is represented as blue nodes. 
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-34.618 

-8.871 

-35.923 

-8.871 

kl4 

move(knight,pos(A,B),pos(C,D))fdiff(A,C,l). 

23.884 

11.271 

13.526 

-12.613 

-10.358 

-13.434 

-11.179 

-11.179 

ql5 

move(queen,pos(A,B),pos(C,D)) fdiff(A,C,3) 

23.884 

32.462 

13.526 

8.578 

-10.358 

-11.315 

-30.251 

-11.315 

kl6 

move(knight,pos(A,B),pos(C,D)) rdiff(B,D,l), fdiff(A.A,0).] 

29.816 

27.052 

13.526 

-2.764 

-16.29 

-12.449 

-25.975 

-12.449 

xl6 

move(king,pos(A,B),pos(C,D)) rdiff(B,D,0), fdiff(B,B,0). 

34.275 

27.052 

13.526 

-7.223 

-20.749 

-12.895 

-26.421 

-12.895 

xl7 

move(king,pos(A,B),pos(C,D)) rdiff(B,D,0), fdiff(A,C,0). 

34.275 

27.052 

13.526 

-7.223 

-20.749 

-12.895 

-26.421 

-12.895 

kl8 

move(knight,pos(A,B),pos(C,D)) rdiff(B,B,0), rdiff(B,D,l), fdiff(A,A,0). 

43.378 

27.052 

13.526 

-16.326 

-29.852 

-13.806 

-27.332 

-13.806 

xl4 

move(king,pos(A,B),pos(C,D))fdiff(A,C,l). 

23.884 

67.63 

27.052 

43.746 

3 

-19.972 

-61 

-19.972 

xl5 

move(king,pos(A,B),pos(C,D)) rdiff(B,B,0), rdiff(B,D,l), rdiff(D,D,0). 

43.378 

67.63 

27.052 

24.252 

-16.326 

-21.921 

-62.499 

-21.921 

k20 

move(knight,pos(A,B),pos(C,D)) rdiff(B,B,0), fdiff(A,A,0). 

23.884 

60.867 

36 

36.983 

12.185 

-28.763 

-53.561 

-28.763 

q21 

move(queen,pos(A,B),pos(C,D))rdiff(B,D,E). 

21.506 

86.566 

49.595 

65.06 

28.089 

-38.129 

-75.1 

-38.129 

rl7 

move(rook,pos(A,B),pos(C,D)). 

22.777 

108.205 

54.104 

85.428 

31.327 

-40.15 

-94.251 

-40.15 

k25 

move(knight,pos(A,B),pos(C,D)) rdiff(B,D,E), fdifF(A,C,F). 

30.105 

129.624 

57.109 

99.519 

27.004 

-41.446 

-113.961 

-41.446 

kl7 

move(knight,pos(A,B),pos(C,D)) fdiff(A,C,E). 

17.047 

154.421 

73.64 

137.374 

56.593 

-52.538 

-133.319 

-52.538 

q24 

move(queen,pos(A,B),pos(C,D)) 

13.858 

189.362 

81.154 

175.504 

67.296 

-55.488 

-163.696 

-55.488 

xl9 

move(king,pos(A,B),pos(C,D)) 

22.918 

189.363 

108.208 

166.445 

85.29 

-80.742 

-161.897 

-80.742 


Table 12: The complete set of rules and their metrics (same as in 
problem without the oblivion mechanism at step 500. Consolidated 
table at the top while the rest of rules in W are placed in the table 


Table [F]) for the 
rules are placed 
at the bottom. 


chess 
in the 
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ID 


Rule 

Kp) 


s_ 

L + 

-L 

Opt + 

Opt_ 

Perm 

xl8 

move(king,pos(A,B),pos(C,D))rdiff(B,D,l), fdiff(A,C,l). 

34.275 

662.774 

0 

628.499 

-34.275 

62.849 

-599.924 

-22.681 

xl3 

move(king,pos(A,B),pos(C,B)) > fdiff(A,C,l). 

22.918 

459.884 

0 

436.966 

-22.918 

43.696 

-416.187 

-41.834 

k22 

move(knight,pos(A,B),pos(C,D)) rdiff(B,D,2), fdiff(A,C,l). 

34.275 

446.358 

0 

412.083 

-34.275 

41.208 

-405.149 

29.394 

q23 

move(queen,pos(A,B),pos(C,D)) rdiff(B,D,E),fdiff(A,C,E). 

28.993 

437.34 

0 

408.347 

-28.993 

40.834 

-396.505 

-1.513 

x20 

move(king,pos(A,B),pos(A,C)) rdiff(B,D,l). 

22.918 

392.254 

0 

369.336 

-22.918 

36.933 

-355.32 

8.116 

rl5 

move(rook,pos(A,B),pos(A,C)). 

22.133 

389.999 

0 

367.866 

-22.133 

36.786 

-353.212 

6.602 

ql9 

move(queen,pos(A,B),pos(C ,B)) • 

13.214 

365.202 

0 

351.988 

-13.214 

35.198 

-330.003 

-7.149 

k24 

move(knight,pos(A,B),pos(C,D)) rdiff(B,D,l), fdiff(A,C,2). 

34.275 

369.71 

0 

335.435 

-34.275 

33.543 

-336.166 

25.561 

ql2 

move(queen,pos(A,C),pos(A,D)). 

13.214 

311.098 

0 

297.884 

-13.214 

29.788 

-281.309 

-12.559 

rl6 

move(rook,pos(A,B),pos(C,B))- 

20.455 

284.046 

0 

263.591 

-20.455 

26.359 

-257.686 

-3.824 

blO 

move(bishop,pos(A,B),pos(C,D))rdiff(B,D,E), fdiff(A,C,E). 

24.534 

275.028 

0 

250.494 

-24.534 

25.049 

-249.978 

12.731 

xl4 

move(king,pos(A,B),pos(C,D)) fdiff(A,C,l). 

23.884 

1122.658 

27.052 

1.099 

3.168 

85.53 

-1010.075 

56.713 

q24 

move(queen,pos(A,B),pos(C ,D)). 

13.858 

1167.744 

81.156 

1.154 

67.298 

42.348 

-1044.239 

42.348 

rl7 

move(rook,pos(A,B),pos(C ,D)). 

22.777 

811.559 

54.104 

788.782 

31.327 

30.184 

-727.27 

30.184 

xl9 

move(king,pos(A,B),pos(C,D)). 

22.918 

1528.437 

135.26 

1505.519 

112.342 

28.817 

-1364.359 

28.817 

bl2 

move(bishop,pos(A,B),pos(C,D))rdiff(B,D,E), fdiff(A,C,F). 

30.105 

275.028 

13.526 

244.923 

-16.579 

12.318 

-249.183 

12.318 

kl2 

move(knight,pos(A,B),pos(C,D)) rdiff(B,D,2). 

23.884 

223.179 

9.017 

199.295 

-14.867 

11.814 

-202.347 

11.814 

kl9 

move(knight,pos(A,B),pos(C,D)) fdiff(A,C,2). 

23.884 

184.855 

9.017 

160.971 

-14.867 

7.981 

-167.856 

7.981 

b5 

move(bishop,pos(A,B),pos(C,D)) fdiff(A,C,2). 

23.884 

78.901 

0 

55.017 

-23.884 

5.501 

-73.399 

5.501 

xl2 

move(king,pos(A,B),pos(C,B)) > fdiff(A,C,l). 

22.918 

0 

54.104 

-22.918 

31.186 

-50.985 

3.118 

3.118 

rl3 

move(rook,pos(d,B) ? pos(d, 5)). 

18.633 

27.052 

0 

8.419 

-18.633 

0.841 

-26.21 

-26.21 

kl 

move(knight,pos(b,2),pos(c,4)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

k7 

move(knight,pos(f, 1 ),pos(g,3)) • 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

k5 

move(knight,pos(f, 1 ),pos(g,3))- 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

x3 

move(king,pos(b,5),pos(c,5)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

q4 

move(queen,pos(b,2),pos(e,2)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

r3 

move(rook,pos(d,2),pos(d,5)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

x6 

move(king,pos(f,6),pos(g,7)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

r2 

move(rook,pos(a,5),pos(h,5)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

r4 

move(rook,pos(a,4),pos(a,5)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-27.052 

b2 

move(bishop,pos(c,4),pos(e,6)). 

27.052 

27.052 

0 

0 

-27.052 

0 

-27.052 

-25.049 

x9 


- move(king,pos(b,2),pos(c,4)). 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

b4 


- move(bishop,pos(f,l),pos(b,3))- 

27.052 

0 

27.052 

-27 

0 

-27 

0 

0 

xlO 


- move(king,pos(b,2),pos(b,4)). 

27.052 

0 

27.052 

-27.052 

0 

-27 

0 

0 

k9 


- move(knight,pos(f,l),pos(c,2)). 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

r6 


- move(rook,pos(h,4),pos(a,2). 

27.052 

0 

27.052 

-27 

0 

-27.052 

0 

0 

q8 


- move(queen,pos(f,l),pos(b,3))- 

27.052 

0 

27.052 

-27 

0 

-27.052 

0 

0 

klO 


- move(knight,pos(f,l),pos(h,3)). 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

qlO 


- move(queen,pos(e,l),pos(b,8)). 

27.052 

0 

27.052 

-27 

0 

-27.052 

0 

0 

k8 


- move(knight,pos(f,l),pos(g,4)). 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

x8 


- move(king,pos(b,2),pos(d,3))- 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

r5 


- move(rook,pos(f,l),pos(b,3)- 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

q9 


- move(queen,pos(f,l),pos(b,4)). 

27.052 

0 

27.052 

-27.052 

0 

-27.052 

0 

0 

qi3 

move(queen,pos(A,B),pos(A,B))- 

12.384 

0 

0 

-12.384 

-12.384 

-1.238 

-1.238 

-1.238 

r8 

move(rook,pos(b,B),pos(e,B))- 

19.133 

0 

0 

-19.133 

-19.133 

-1.913 

-1.913 

-1.913 

q2i 

move(queen,pos(A,B),pos(C,D)) rdiff(B,D,E). 

21.506 

13.526 

81.156 

-7.98 

59.65 

-73.838 

-6.208 

-6.208 

bll 

move(bishop,pos(A,B),pos(C,D)) rdiff(B,B,0), rdiff(D,D,0), 

43.635 

9.017 

13.526 

-34.618 

-30.109 

-15.635 

-11.126 

-11.126 

xl5 

move(king,pos(A,B),pos(C,D))rdiff(B,B,0), rdiff(B,D,l), 

43.378 

13.526 

27.052 

-30 

-16.326 

-27.332 

-13.806 

-13.806 

k25 

move(knight,pos(A,B),pos(C,D)) rdiff(B,D,E), fdiff(A,C,F). 

30.105 

448.611 

63.121 

418.506 

33.016 

-14.958 

-400.448 

-14.958 

k20 

move(knight,pos(A,B),pos(C,D)) rdiff(B,B,0), fdiff(A,A,0). 

23.884 

40.578 

63.121 

16.694 

39.237 

-55.139 

-32.596 

-32.595 


Table 13: Rules and metrics (same as in Table [7]) for the chess problem with the oblivion 
mechanisms at step 500. Consolidated rules are placed in the table at the top while the rest 
of rules in W are placed in the table at the bottom. 
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